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ABSTRACT 

Many applications rely on Web data and extraction systems to ac¬ 
complish knowledge-driven tasks. Web information is not curated, 
so many sources provide inaccurate, or conflicting information. 
Moreover, extraction systems introduce additional noise to the data. 
We wish to automatically distinguish correct data and erroneous 
data for creating a cleaner set of integrated data. Previous work has 
shown that a naive voting strategy that trusts data provided by the 
majority or at least a certain number of sources may not work well in 
the presence of copying between the sources. However, correlation 
between sources can be much broader than copying: sources may 
provide data from complementary domains (negative correlation), 
extractors may focus on different types of information ( negative 
correlation), and extractors may apply common rules in extraction 
(positive correlation , without copying). In this paper we present 
novel techniques modeling correlations between sources and apply¬ 
ing it in truth finding. We provide a comprehensive evaluation 
of our approach on three real-world datasets with different charac¬ 
teristics, as well as on synthetic data, showing that our algorithms 
outperform the existing state-of-the-art techniques. 

Categories and Subject Descriptors: 

H. 3.5 [Online Information Services]: Data sharing 
Keywords: data fusion; integration; correlated sources 

I. INTRODUCTION 

The Web is an incredibly rich source of information, which is 
growing at an unprecedented pace and is amassed by a plethora 
of contributors. An increasing number of users and applications 
rely on online data as the main resource to satisfy their informa¬ 
tion needs. Web data is not curated and sources may often provide 
erroneous or conflicting information. Additionally, a lot of Web 
data is largely unstructured, lacking a predefined schema or consis¬ 
tent format. As a result, knowledge-driven applications in various 
domains (e.g., finance, technology, advertisement, etc.) rely on 
information extraction systems to retrieve structured relations from 
online sources. However, extraction systems have less than perfect 
accuracy, invariably introducing more noise to the data. 

Our goal is to automatically distinguish correct data and erroneous 
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data for creating a cleaner set of data. A naive approach to achieve 
this goal is majority voting: we trust the data provided by the 
majority, or at least a certain number of sources. However, such 
a strategy may perform badly for two reasons. First, sources may 
provide data from complementary domains (e.g., information on 
scientific books vs. on biographies) and extractors may focus on 
different types of information (e.g., extracting from the Infobox or 
the texts of Wikipedia pages); blindly requiring agreement among 
sources may miss correct data and cause false negatives. Second, 
sources may easily copy and share data [2] and extractors may 
apply common rules; blindly trusting agreement among sources may 
enforce erroneous data and caus e false positives. Such correlation 
or anti-correlation between sources makes it especially hard to tell 
the truth from wrong statements or extractions, as illustrated next. 

EXAMPLE 1.1. Figure 1 depicts example data extracted from 
the Wikipedia page for Barack Obama, using five different extraction 
systems. Extracted data consist of knowledge triples in the form of 
(subject, predicate, object }; for example, [Obama, spouse, Michelle) 
states that the spouse of Obama is Michelle. Some extracted triples 
are incorrect. For example, triple t 2 is false: extraction systems 
Si and S 2 derived the triple from a sentence referring to Barack 
Obama Sr, rather than the current US president. 

Various types of correlations exist among the five sources. First, 
Si , S 4 and S 5 implement similar extraction patterns and extract 
similar sets of triples; there is a positive correlation between these 
sources. Second, S 3 extracts triples from the Infobox of the Wikipedia 
page while Si (similarly, S 4 and S 5 ) extracts triples from the text; 
their extracted triples are largely complementary and there is a 
negative correlation between them. 

Figure 1c shows the precision, recall, and F-measure of voting 
techniques: Union-k accepts a triple as true if at least k% of the 
extractors extract it; e.g., Union-25 accepts triples provided by at 
least 2 extractors: it has high recall (missing only one triple), but 
makes a lot of mistakes (extracting 4 false triples) because of the 
common mistakes by the correlated sources Si, S 4 , and S 5 . Union- 
75 accepts triples provided by at least 4 extractors; it misses a lot of 
true triples since S 3 is anti-correlated with three other sources. 

Data fusion has studied resolving conflicts while considering 
source copying [5,6]. Previous approaches have two limitations. 
First, they focus on copying of data between sources and are based 
on the intuition that common mistakes are strong evidence of copy¬ 
ing; correlation is much broader: it can be positive or negative and 
can be caused by different reasons. Previous approaches are effective 
in detecting positive correlation on false data, but are not effective 
with positive correlation on true data or negative correlation. Second, 
their model relies on the single-truth assumption such as everyone 
has a unique birthplace; however, in practice there can be multi- 
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(a) Data extracted by five different extractors from the Wikipedia page for Barack Obama. The / symbols indicate which extraction systems 
produce each knowledge triple; for example, t 3 is extracted by S 3 , but not by any other extractor. 
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(b) Precision and recall for each extractor, and joint precision and joint 
recall for some combinations of extractors. 
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(c) Naive fusion approaches based on voting do not achieve very 
good results, as they do not account for correlations among the 
extractors. 


Figure 1: Example 1.1: (a) knowledge triples derived by 5 extractors, (b) extractor quality and correlations, (c) voting results. 


pie truths for certain “facts”, such as someone may have multiple 
professions (e.g., triples ti, t 3 , and tio in Figure 1 are all correct). 

In this paper, we address the problem of finding truths among 
data provided by multiple sources, which may contain complex 
correlations. We make the following contributions. 

• We propose measuring the quality of a source as its precision and 
recall and measuring the correlation between a subset of sources 
as their joint precision and joint recall. We express them in terms 
of conditional probability (Section 2). 

• We present a novel technique that derives the probability of a 
triple being true from the precision and recall of the sources using 
Bayesian analysis under the independence assumption (Section 3). 
Our experiments show that even before incorporating correlations, 
our basic approach often outperforms existing state-of-the-art 
techniques. 

• We extend our approach to handle correlations between the sources. 
We first present an exact solution that is exponential in the number 
of data sources. We then present two approximation schemes: the 
aggressive approximation reduces the computational complex¬ 
ity from exponential to linear, but sacrifices the accuracy of the 
predictions; our elastic approximation provides a mechanism to 
trade efficiency for accuracy (Section 4). 

• We conduct a comprehensive evaluation of our techniques against 
three real-world data sets, as well as synthetic data. Our experi¬ 
ments show that our methods can significantly improve the results 
by considering correlation without adding too much overhead for 
efficiency (Section 5). 

2. THE FUSION PROBLEM 

In this section, we introduce our data model and its semantics, 
we provide a formal definition of the problem of fusing data from 
sources with unknown correlations, and we present a high-level 
overview of our approach. We summarize notations in Figure 2. 

2.1 Data model 

We consider a set of data sources S = {Si,..., S n }- Each 
source provides some data and we call each unit of data a triple', 
a triple can be considered as a cell in a database table in the form 
of } row-entity, column-attribute, value/ (e.g., in a table about politi¬ 


cians, a row can represent Obama, a column can represent attribute 
profession, and the corresponding cell can have value president ), 
or an RDF triple in the form of {subject, predicate, object/, such 
as (Obama, profession, president/. We denote with Oi the triples 
provided by source Si € S', interchangeably, we denote with Si |= t 
or t £ Oi that Si provides triple t. Our data model consists 
of S = {Si,..., S„} and the collections of their output triples 
O = {0 1 ,..., O n }. In a slight abuse of notation, we write t £ O 
to denote that 3 Oi £ O such that t £ Oi. We use Ot to represent 
the subset of outputs in O that involve triple t; note that Ot con¬ 
tains the observation that a source Si does not provide t only if Si 
provides other data in the domain of t, so we do not unnecessarily 
penalize data missing from irrelevant sources. 

We consider deterministic sources: a source either outputs a triple, 
or it does not. In practice, a source Si £ S may provide a confidence 
score associated with each triple t £ Op, we can consider that Si 
outputs t if the assigned confidence score exceeds a certain threshold. 
As in previous work [6,25], we assume that schema mapping and 
reference reconciliation have been applied so we can compare the 
triples across sources. 

Our goal is to purge the output of all incorrect triples to obtain 
a high-quality data set R = {t : t £ O A t is true}. We say 
that a triple t is true if it is consistent with the real world, and 
false otherwise; for example, {Obama, profession, president) is true 
whereas {Obama, died, 1982) is false. We next show an instantiation 
of our data model for the data extraction scenario. 

EXAMPLE 2.1. Figure 1 shows triples extracted by five extrac¬ 
tors from the Wikipedia page for Barack Obama and we need to 
determine which triples are correctly extracted. We consider that 
each extractor corresponds to a source; for example, S 1 corre¬ 
sponds to the first extractor and it provides (among others) triple 
ti : [Obama, profession, president]. We denote this as Si |= ti, 
meaning that the extractor believes that ti is a fact that appears on 
the Wikipedia page. Accordingly, 0 1 = {ti,t 2 ,te,tr,tg,tg,tio}. 

Based on S and O, we decide whether each triple ti is true 
(i £ [1,10]). In this scenario, the extractor input (the processed web 
page) represents the “real world”, against which we evaluate the 
correctness of the extractor outputs. We consider a triple to be true 
(i.e., correctly extracted) if the web page indeed provides the triple. 




Semantics: In this paper, we make two assumptions about seman¬ 
tics of the data: First, we assume triple independence: the truth¬ 
fulness of each triple is independent of that of other triples. For 
example, whether the page indeed provides triple t\ is independent 
of whether the page provides triple t 2 ■ Second, we assume open- 
world semantics: a source considers any triple in its output as true, 
and any triple not in its output as unknown (rather than false). For 
example, in Figure 1, Si provides t\ and tz but not tz, meaning 
that it considers ti and tz as being provided by the page, but does 
not know whether tz is also provided. Note that this is in contrast 
with the conflicting-triple, closed-world semantics in [6]; under 
this semantics, {Obama, religion, Christian} and {Obama, religion, 
Muslim} would be considered conflicting with each other, as we 
typically assume one can have at most one religion and a source 
claiming the former implicitly claims that the latter is false. 

We make these assumptions for two reasons. The first reason 
is that they are suitable for many application scenarios. One ap¬ 
plication scenario is data extraction, as shown in our motivating 
example: when an extractor derives two different triples from a Web 
page (often from different sentences or phrases), the correctness 
of the two extractions are independent; if an extractor does not de¬ 
rive a triple from a Web page, it usually indicates that the extractor 
does not know whether the page provides the triple, rather than 
that it believes that the page does not provide the triple. Another 
scenario is attributes that can accept multiple truths. For example, 
a person can have multiple professions: the correctness of each 
profession is largely independent of other professions 1 , and a source 
that claims that Obama is a president does not necessarily claim 
that Obama cannot be a lawyer. The second reason is that, to the 
best of our knowledge, all previous work that studies correlation 
of sources focuses on the conflicting-triple, closed-world seman¬ 
tics; the independent-triple, open-world semantics allows us to fill 
the gap in the existing literature. Note that we can apply strate¬ 
gies for conflicting-triple and closed-world semantics in the case 
of independent-triple and closed-world semantics, or in the case of 
conflicting-triple and open-world semantics. We leave combination 
of all semantics for future work. 

2.2 Measuring truthfulness 

The objective of our framework is to distinguish true and false 
triples in a collection of source outputs. A key feature of our ap¬ 
proach is that it does not assume any knowledge of the inner work¬ 
ings of the sources and how they derive the data that they provide. 
First, this is indeed the case in practice for many real-world data 
sources — they provide the data without telling us how they obtain it. 
Second, even when some information on the data derivation process 
is available, it may be too complex to reason about; for example, 
an extractor often learns thousands (or even more) of patterns (e.g., 
distance supervision [18]) and uses internal coding to present them; 
it is hard to understand all of them, let alone to reason about them 
and compare them across sources. 

Next, we show which key evidence we consider in our approach 
and then formally define our problem. 

Source quality 

The quality of the sources affects our belief of the truthfulness of a 
triple. Intuitively, if a source S has high precision (i.e., most of its 
provided triples are true), then a triple provided by S is more likely 
to be true. On the other hand, if S has a high recall (i.e., most of the 


1 Arguably, it is unlikely for a person to be a doctor, a lawyer, and a plumber 
at the same time as they require very different skills; we leave such joint 
reasoning with a priori knowledge for future work. 
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Description 

5 

Set of sources S = {Si,..., S n \ 

Oi 

Set of output triples of source Si 

O 
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Subset of observations in O that refer to triple t 

Pi (resp. ) 

Precision of source Si (resp. sources S*) 

r; (resp. r s *) 

Recall of source Si (resp. sources <S*) 

Qi (resp. q s .) 

False positive rate of Si (resp. S*) 

S i \= t 

Si outputs t (t E Oi) 

S* |= t 

wSi e s *, Si \= t 

Pr (t | O) 

Correctness probability of triple t 

Pr(t), Pr(-.t) 

Pr(t = true) and Pr(t = false) respectively 


Figure 2: Summary of notations used in the paper. 


true triples are provided by S), then a triple not provided by S is 
more likely to be false. 

We define precision and recall in the standard way: the precision 
Pi of source Si £ S represents the portion of triples in the output 
Oi that are true; the recall r; of S', represents the portion of all true 
triples that appear in O,. These metrics can be described in terms of 
probabilities as follows. 

Pi = Pr (f | Si |= t) (1) 

ri = Pr (Si \=t\t) (2) 

EXAMPLE 2.2. Figure lb shows the precision and recall of the 
five sources. For example, the precision of Si is | = 0.57, as only 
4 out of the 7 triples in 0 1 are correct. The recall is | = 0.67, as 4 
out of the 6 correct triples are included in 0 1 . 

The recall of a source should be calculated with respect to the 
“scope” of its input. For example, if a source S provides only infor¬ 
mation about Obama but not about Bush, we may penalize the recall 
of S for providing only 1 out of the 3 professions of Obama, but 
should not penalize the recall of S for not providing any profession 
for Bush. For simplicity of presentation, in the rest of the paper we 
ignore the “scope” of each source in our discussion, but all of our 
techniques work with either version of recall calculation. 

Correlation 

Another key factor that can affect our belief of triple truthfulness 
is the presence of correlations between data sources. Intuitively, 
if we know that two sources Si and .S’, are nearly duplicates of 
each other, thus they are positively correlated, the fact that both 
provide a triple t should not significantly increase our belief that t 
is true. On the other hand, if we know two sources Si and Sj are 
complementary and have little overlap, so are negatively correlated, 
the fact that a triple t is provided by one but not the other should 
not significantly reduce our belief that t is true. Note the difference 
between correlation and copying [6]: copying can be one reason 
for positive correlation, but positive correlation can be due to other 
factors, such as using similar extraction patterns or implementing 
similar algorithms to derive data, rather than copying. 

We use joint precision and joint recall to capture correlation 
between sources. The joint precision of sources in <S*, denoted by 
ps *, represents the portion of triples in the output of all sources in 
S * (i.e., intersection) that are correct; the joint recall of S*, denoted 
by r\s*, represents the portion of all correct triples that are output 
by all sources in S*. If we denote by S * |= t that a triple t is 
output by all sources in S*, we can describe these metrics in terms 
of probabilities as follows. 

PS * = Pr (t | S* |= t) (3) 

rs * = Pr (S* |= t | t) (4) 




Example 2.3. Figure lb shows the joint precision and recall 
for selected subsets of sources. Take the sources {51,54, 5s} as 
an example. They provide similar sets of triples: they all provide 
ti,te,tg,tg, and tio. Their joint precision is | = 0.6 and their 
joint recall is | = 0.5. Note that if the sources were independent, 
their joint recall would have been n • r 4 • rs = 0.3, much lower 
than the real one (0.5); this indicates positive correlation. 

On the other hand, 5i and S 3 have little overlap in their data: 
they both provide triples tj and tio- Their joint precision is | = 1 
and their joint recall is | = 0.33. Note that if the sources were 
independent, their joint recall would have been n ■ r 3 = 0.45, 
higher than the real one (0.33); this indicates negative correlation. 

We define positive and negative correlation formally in Section 4. 

Problem definition 

Our goal is to determine the truthfulness of each triple in O. We 
model the truthfulness of t as the probability that t is true, given the 
outputs of all sources; we denote this as Pr (t | O). We can accept a 
triple t as true if this probability is above 0.5, meaning that t is more 
likely to be true than to be false. As we assume the truthfulness of 
each triple is independent, we can compute the probability for each 
triple separately conditioned on the provided data regarding t; that 
is, Ot. We frame our problem statement based on source quality 
and correlation. For now we assume the source quality metrics 
and correlation factors are given as input; we discuss techniques to 
derive them shortly. We formally define the problem as follows: 

Definition 2.4 (Triple Truthfulness). Given (1) a set 
of sources S = {5i,..., S n }, (2) their outputs O = {0 1 ,..., O n }, 
and (3) the joint precision ps• and recall rs * of each subset of 
sources S* C S, compute the probability for each output triple 
t £ O, denoted by Pr (f | Ot). 

Note that given a set S of n sources, there is a total of 2(2” — 
1) joint precision and recall parameters. Since the input size is 
exponential in the number of sources, even a polynomial algorithm 
will be infeasible in practice. We show in Section 4 how we can 
reduce the number of parameters we consider in our model and 
solve the problem efficiently. 

2.3 Overview 

We start by studying the problem of triple probability computation 
under the assumption that sources are, indeed, independent. We will 
show that even in this case, there are challenges to overcome in order 
to derive the probability. We then extend our methods to account for 
correlations among sources. Here, we present an overview of some 
high-level intuitions that we apply in each of these two settings. 

Independent sources (Section 3) 

Fusion of data from multiple sources is challenging because the 
inner-workings of each source are not completely known. We 
present a method that uses source quality metrics (precision and 
recall) to derive the probability that a source provides a particular 
triple, and applies Bayesian analysis to compute the truthfulness of 
each triple. We describe how to derive the quality metrics if those are 
unknown. With this model, we are able to improve the F-measure 
to .86 (precision=.75, recall=l) for the motivating example. 

Correlated sources (Section 4) 

Sources are often correlated: they may copy data from each other, 
employ similar techniques in deriving the data, or analyze comple¬ 
mentary portions of the raw data sets. Correlations can be positive 
or negative, and are generally unknown. We address two main 
challenges in the case of correlated sources. 


• Using correlations: We start by assuming that we know concrete 
correlations between sources. We will see that the main insight 
into revising the probability of triples is to determine how likely 
it is for a particular triple to have appeared in the output of a 
given subset of sources but not in the output of any other source. 
Further, we use the inclusion-exclusion principle to express the 
correctness probability of a triple using the joint precision and 
joint recall of subsets of sources. 

• Exponential complexity: The number of correlation parameters 
is exponential in the number of sources, which can make our 
computation infeasible. To counter this problem, we develop two 
approximation methods: our aggressive approximation reduces 
the computation from exponential to linear, but sacrifices accu¬ 
racy; our elastic approximation provides a mechanism to trade 
efficiency for accuracy and improve the quality of our approxima¬ 
tion incrementally. 

Considering correlations, we can further improve the F-measure to 
0.91 (precision=l, recall=0.83) for our motivating example, which 
is 18% higher than Union-50 (i.e., majority voting). 

3. FUSING INDEPENDENT SOURCES 

In this section, we start with the assumption that the sources are 
independent. Our goal is to estimate the probability that an output 
triple t is true given the observed data: Pr (f | Ot). We describe a 
novel technique to derive this probability based on the quality of 
each source (Sec. 3.1). Since these quality metrics are not always 
known in advance, we also describe how to derive them if we are 
given the ground truth on a subset of the extracted data (Sec. 3.2). 

3.1 Estimating triple probability 

Given a collection of output triples for each source Ot, our ob¬ 
jective is to compute, for each t £ O, the probability that t is 
true, Pr (t \ O), based on the quality of each source. Due to the 
independent-triple assumption, Pr (t \ O) — Pr (t \ Ot). 

We use Bayes’ rule to express Pr (t | Ot) based on the inverse 
probabilities Pr (Ot. | f)andPr(0t | -if), which represent the prob¬ 
ability of deriving the observed output data conditioned on t being 
true or false respectively. In addition, we denote the a priori proba¬ 
bility that t is true with Pr(t) = a. 

p r (t \ n \ = _ aPr(Ot | t) _ 

1 1 a Pr (O t | t) + (1 - a) Pr (O t | -t) j 

The denominator in the above expression is equal to Pr(Ch). The a- 
priori probability a can be derived from a training set (i.e., a subset 
of the triples with known ground truth values, see Section 3.2). 

We denote by St the set of sources that provide t, and by St the 
set of sources that do not provide t. Assuming that the sources are 
independent, the probabilities Pr (Ot \ t) and Pr (Ot. | -ft) can then 
be expressed using the true positive rate, also known as sensitivity 
or recall, and the false positive rate, also known as the complement 
of specificity, of each source as follows: 

Pr (O t | t) = J] Pr (Si (= t\ t) II (l-Pr (Si Ml*)) (6) 
SiGSt Si^Si 

Pr (Ot |-i£) = n Pr (Si |= t \ -<t) H (1- — Pr (Si |= t | ->*)) (7) 

S;GS t SiGSj 

From Eq. (2), we know n = Pr (5; \= t \ t). We denote the 
false positive rate by qt = Pr (5; |= t \ -if) and describe how we 
derive it in Section 3.2. Applying these to Eq. (5), we obtain the 
following theorem. 



Theorem 3.1 (Independent Sources). Given a set of in¬ 
dependent sources S = {Si,..., S n }, the recall n and the false 
positive rate qi of each source Si, the correctness probability of an 
output triple t is Pr (t \ Of) = i ■ where 


n 5 n 

SiCSt H S;SS t - 


1 — Ti 
1 — qi 


(8) 


Intuitively, we compute the correctness probability based on the 
(weighted) contributions of each source for each triple. Each source 
Si has contribution — for a triple that it provides, and contribution 
IckLL f or a triple that it does not provide. Given a triple t , we 
multiply the corresponding contributions of all sources to derive /i, 
and then compute the probability of the triple accordingly. 

We say a source Si is good if it is more likely to provide a true 
triple than a false triple; that is, Pr (Si \= t \ t) > Pr (Si |= t \ -if) 
(i.e., Vi > q-i). Thus, a good source has a positive contribution for a 
provided triple — once it provides a triple, the triple is more likely 
to be true; otherwise, the triple is more likely to be false. 

Proposition 3.2. Let S' = S U {5"} and O' = O U {O'}. 

• If S' is a good source: 

- If S' \= t, then Pr (t \ 0' t ) > Pr (t \ Of). 

- If S’ b t, then Pr (t \ O’f) < Pr (t \ Of). 

• If S' is a bad source: 

- If S' b t, then Pr (t \ O'f) < Pr (t \ Of). 

- If S' b t, then Pr (t \ Of > Pr (t | Of. 


EXAMPLE 3.3. We apply Theorem 3.1 to derive the probability 
oft 2 , which is provided by Si and S 2 but not by S 3 , S 4 , or S 5 : 

_ ri V 2 1 — r 3 1 — n 1 — r 5 

^ qi 52 1 — q 3 1 — 54 1 — ?5 

Suppose we know that q\ — 0.5, q 2 = 0.67, 53 = 0.167, and 
Qi = 95 = 0.33, and we know the recall as shown in Figure lb. 
Then we compute p = 0.1. With a = 0.5, Theorem 3.1 gives 
Pr (t 2 | Ot 2 ) = 0.09, so we correctly determine that (2 is false. 

However, assuming independence can lead to wrong results: tg is 
provided by {Si, S 2 , S 4 , Ss}, but not by S 3 . Using Eq. ( 8 ) produces 
p = 1.6 and Pr (tg \ Ot 8 ) = 0.62, but tg is in reality false. 

3.2 Estimating source quality 

Theorem 3.1 uses the recall and false positive rate of each source 
to derive the correctness probability. We next describe how we 
compute them from a set of training data, where we know the 
truthfulness of each triple. Existing work [9] also relies on training 
data to compute source quality, while crowdsourcing platforms, 
such as Amazon Mechanical Turk, greatly facilitate the labeling 
process [17], 

Computing the recall (ri) relies on knowledge of the set of true 
triples, which is typically unknown a priori. Since we only need 
to decide truthfulness for each provided triple, we use the set of 
true triples that are provided by at least one source in the training 
data. Then, for each source Si,i € [l,n], we count the number 
of true triples it provides and compute its recall according to the 
definition. In our motivating example (Figure 1), there are 6 true 
triples extracted by at least one extractor; accordingly, the recall of 
S'i is g = 0.67 since it provides 4 true triples. 

However, we cannot compute the false positive rate (qf) in a 
similar way by considering only false triples in the training data. We 
next illustrate the problem using an example. 


EXAMPLE 3.4. Consider deriving the quality of Si from the 
training set {ti ,..., tio}- Since the sources are all reasonably 
good, only 4 out of 10 triples are false. If we compute qi directly 
from the data, we have qi = | =0.75. Since qi > ri = 0.67, we 
would (wrongly) consider Si as a bad source. 

Now suppose there is an additional source So that provides 10 
false triples tu — (20 and we include it in the training data. We 
would then compute qi = = 0.21; suddenly, Si becomes a good 

source and much more trustable than it really is. 


To address this issue, we next describe a way that derives the 
false positive rate from the precision and recall of a source. The 
advantage of this approach is that the precision of a source can 
be easily computed according to the training data and would not 
be affected by the quality of other sources. Using Bayes' Rule on 
Pr (t | Si |= t) we obtain a formula similar to Eq. (5), and then we 
apply the conditional probability expressions forp;, n, and q,;. 


Pr (t | Si |= t) = 


aPr(5; b t | t) 

a Pr (Si b t \t) + (1 — a) Pr (Si \= t | -if) 



an 

ari + (1 — a)qi 



1 - Pi 
Pi 


For our example, we would compute the precision of S' as | = 
0.57. Assuming a = 0.5, we can derive qi = brjfs ' 1 ~o°st 7 ' 
0.67 = 0.5, implying that Si is a borderline source, with fairly low 
quality (recall that n = 0.67 > 0.5). Note that for qi to be valid, it 
needs to fall in the range of [0,1]. The next theorem formally states 
the derivation and gives the condition for it to be valid. 


THEOREM 3.5. Let Si,i £ [1, n], be a source with precision pi 
and recall n. 


• If a < 


Pi+ri-piVi 


we have qi = bb 


i-gi 


If Pi > a, Si is a good source (i.e., qi < n). 


Finally, we show in the next proposition that a triple provided by 
a high-precision source is more likely to be true, whereas a triple 
not provided by a good, high-recall source is more likely to be false, 
which is consistent with our intuitions. 


Proposition 3.6. Let S' = S U {S'} and O' = Ok) {O'}. 
Let S" = S U {S“} and O" = O U {O"}. The following hold. 

• Ifrg' = rs", Ps' > Ps" > an d S' b 1 an -d S" \= t, then 
Pr (t | O'f > Pr (i | O'f). 

• If PS' = Ps"> a, rs' > rs", and S' b t and S'' b f then 
Pr (t | Of < Pr (f | Off. 

Comparison with LTM [25] 

The closest work to our independent model is the Latent Truth Model 
(LTM) [25]; it treats source quality and triple correctness as latent 
variables and constructs a graphical model, and performs inference 
using Gibbs sampling. LTM is similar to our approach in that (1) it 
also assumes triple independence and open-world semantics, and ( 2 ) 
its probability computation also relies on recall and false positive 
rate of each source. However, there are three major differences. 
First, it derives the correctness probability of a triple from the Beta 
distribution of the recall and false positive rate of its providers; 
our model applies Bayesian analysis to maximize the a posteriori 
probability. Using the Beta distribution enforces assumptions 
about the generative process of the data, and when this model does 
not fit the actual dataset, LTM has a disadvantage against our non- 
parametric approach. Second, it computes the recall and false 
positive rate of a source as the Beta distribution of the percentage 



of provided true triples and false triples; our model derives false 
positive rate from precision and recall to avoid being biased by 
very good sources or very bad sources. Third, it iteratively decides 
truthfulness of the triples and quality of the sources; we derive 
source quality from training data. 

We compare LTM with our basic approach experimentally, show¬ 
ing that we have comparable results in general and sometimes better 
results; we also show that the correlation model we will present in 
the next section obtains considerably better results than LTM. which 
assumes independence of sources. 

4. FUSING CORRELATED SOURCES 

Theorems 3.1 and 3.5 summarize our approach for calculating 
the probability of a triple based on the precision and recall of each 
source. In this section, we extend the results to account for corre¬ 
lations among sources. Before we proceed, we first show several 
scenarios where considering correlation between sources can signif¬ 
icantly improve the results. 

EXAMPLE 4.1. Consider a set of n good sources S = {Sr,..., 
S„}. AH sources in S have the same recall r and false positive rate 
q, r > q. Given a triple t provided by all sources, Theorem 3.1 
computes p i„d ep = (^) n . 

Scenario 1 (Source copying): Assume that all sources in S are 
replicas. Ideally, we want to consider them as one source; indeed, 
their joint recall is r and joint false positive rate is q. Thus, we 
compute p C orr = - < Pindep, which results in a lower probability 
for t; in other words, a false triple would not get a high probability 
just because it is copied multiple times. 

Scenario 2 (Sources overlapping on true triples): Assume that all 
sources in S derive highly overlapping sets of true triples but each 
source makes independent mistakes (e.g., extractors that use differ¬ 
ent patterns to extract the same type of information). Accordingly, 
their joint recall is close to r and their joint false positive rate is 
q n . Thus, we compute p C orr ~ fpz > IMndep, which results in a 
higher probability for t; in other words, we will have much higher 
confidence for a triple provided by all sources. 

Scenario 3 (Sources overlapping on false triples): Consider the 
opposite case: all sources have a high overlap on false triples but 
each source provides true triples independently (e.g., extractors that 
make the same kind of mistakes). In this case, the joint recall is 
r n and the joint false positive rate is close to q. Thus, we compute 
p corr m ~ < p,i n dep, which results in a lower probability for t; 
in other words, considering correlations results in a much lower 
confidence for a common mistake. 

Scenario 4 (Complementary sources): Assume that all sources are 
nearly complementary: their overlapping triples are rare but highly 
trustable (e.g., three extractors respectively focus on info-boxes, 
texts, and tables that appear on a Wikipedia page). Accordingly, the 
sources have low joint recall but very high joint precision; assume 
their joint recall is r' <K r, and their joint false positive rate is q', 
which is close to 0. Then, we compute p CO rr = ss oo; in other 
words, we highly trust the triples provided by all sources. 

Under the same scenario, consider a triple t' provided by only one 
source S G S. Considering the negative correlation, the probability 
that a triple is provided only by S is r for true triples and q for 
false triples; thus, (t'corr = \ > § • (l3§)" 1 = Tindep > which 
results in higher probability for t'. In other words, considering the 
negative correlation, the correctness probability of a triple won’t be 
penalized if only a single source provides the triple. 

These scenarios exemplify the differences of our work and copy 
detection in [5], Copy detection can handle scenario 1 appropriately; 


in scenarios 2 and 3 it may incorrectly conclude with copying and 
compute lower probability for true triples; it cannot handle anti¬ 
correlation in Scenario 4. 

We first present an exact solution, described by Theorem 4.2. 
However, exact computation is not feasible for problems involving 
a large number of sources, as the number of terms in the compu¬ 
tation formula grows exponentially. In Section 4.2, we present an 
aggressive approximation, which can be computed in linear time 
by enforcing several assumptions, but may have low accuracy. Our 
elastic approximation (Section 4.3) relaxes the assumptions gradu¬ 
ally, and can achieve both good efficiency and good results. Note 
that we can compute joint precision and joint recall, and derive joint 
false positive rate exactly the same way as we compute them for a 
single source (Section 3.2). 

4.1 Exact solution 

Recall that Eq. (6) and (7) compute Pr (Ot \ t ) and Pr (Ot \ ~'t) 
by assuming independence between the sources. Now, we show 
how to compute them in the presence of correlations. Using St. to 
represent the set of sources that provide f, and S; to represent the 
set of sources that do not provide f, we can express Pr (Ot \ t) as: 


Pr 


(Ot I t) = Pi/( A S H t] A ( f\ S' 

\\sest J Vs'esj ) 


(9) 


We apply the inclusion-exclusion principle to rewrite the formula 
using the joint recall of the sources: 


Pr (Ot | t) = £ (—l) 15 * 1 Pr ({St U 5*} \=t\t) 
s*QS t 

= Y (-i) 15 * 1 ^*. (10) 

S*C5 f 

Note that when the sources are independent, Eq. (10) computes 
exactly ris gs 4 r * II<? gs-( 1 — r *)’ which is equivalent to Eq. (6). 
We compute Pr (O | ->t) in a similar way, using the joint false posi¬ 
tive rate of the sources, which can be derived from joint precision 
and joint recall as we described in Theorem 3.5: 


Pr (Ot | ~<t) = Y (-l)' S *'?s t u5* (H) 

<S* CSf 

Theorem 4.2 extends Theorem 3.1 for the case of correlated 
sources. 


THEOREM 4.2. Given a set of sources S = {Si,..., S„}, the 
joint recall and joint false positive rate for each subset of the sources, 
the probability of a triple t is Pr (t \ O) = — t _ a t , where 

+ a ' fi 


Pr (O t | t) 
Pr (O t | -.i) 


( 12 ) 


and Pr (Ot \ t), Pr (Ot | -if) are computed by Eq. (10) and (11). 


Corollary 4.3. Given a set S = {Si,..., S n }, where all 
sources are independent, the correctness probabilities computed 
using Theorems 3.1 and 4.2 are equal. 


EXAMPLE 4.4. Triple ts of Figure la is provided by St s — 
{Si, S 2 , S 4 , S 5 }. We use notations r{s 1 ,s 2 ,S 4 ,s 5 } an d ri245 inter¬ 
changeably. We can compute joint recall for a set of sources as we 
do for a single source (Section 3.2), but here we assume that all the 
joint recall and joint false positive rate parameters are given. 




Si 

s 2 

s 3 

s 4 

Sg 

c+ 

0.11 

1 

0.75 

1.5 

1.5 

c~ 

°W 37 0 

0.5*0.037 

1 

1 

3 

3 


Figure 3: Correlation parameters of the aggressive approxima¬ 
tion computed for each source of Figure la. 

We compute Pr (Ot \ tg) andPv (Ot \ ~^tg), according to Eq. (10): 


anti-correlated with the rest of the sources, whereas S 4 and Sg are 
correlated. Accordingly, we compute p a ggr as follows: 

_ 0.67 • 0.5 • (1 - 0.75 • 0.67) • 1.5 ■ 0.67 • 1.5 ■ 0.67 
fJ ' as9r ~ 2 • 0.5 ■ 0.67 • (1 - 0.167) • 3 • 0.33 • 3 ■ 0.33 

Thus, we compute Pr (tg \ O) = 1 1 = 0.23, which is 

Vaggr 

lower than the exact computation in Example 4.4. Both approaches 
correctly determine that tg is false. 


Pr (Ot s | tg ) =r 1245 — r 12345 = 0.22 — 0.11 = 0.11 
Pr (Ot s | '-'tg) =51245 — 512345 = 0.22 — 0.037 = 0.185 

Assuming a-priori probability a = 0.5, we derive Pr (tg \ O) = 
—ttttbs = 0.37. Note that although tg is provided by four out of 

the five sources. Si, S 4 , and Sg are correlated, which reduces their 
contribution to the correctness probability oftg. Using correlations 
allows us to correctly classify tg as false, whereas the independence 
assumption leads to the wrong result, as shown in Example 3.3. 

Even though accounting for correlations can significantly improve 
accuracy, it increases the computational cost. The computation 
of Pr (Ot | t ) and Pr ( O t | -it) is exponential in the number of 
sources that do not provide t , thus impractical when we have a large 
number of sources. We next describe two ways to approximate 
Pr (O t | t ) and Pr (Ot | -f). 


Obviously, the computation is linear in the number of sources. 
Also, instead of having an exponential number of joint recall and 
false positive rate values, we only need Cf and C~ for each St,i £ 
[ 1 , n], which can be derived from a linear number of joint recall and 
false positive rate values. However, as the following proposition 
shows, this aggressive approach can produce bad results for special 
cases with strong correlation (i.e., sources are replicas), or strong 
anti-correlation (i.e., sources are complementary to each other). 

PROPOSITION 4.8. If all sources in S provide the same data, 
Definition 4.5 computes probability afar each provided triple. 

If every pair of sources in S are complementary to each other. 
Definition 4.5 does not compute a valid probability for any triple. 

Next, we proceed to describe the three major steps that lead to 
Definition 4.5. 


4.2 Aggressive approximation 

In this section, we present a linear approximation that reduces 
the total number of terms in the computation by enforcing a set of 
assumptions. We first present the main result for the approximation 
in Definition 4.5. and we show how we derive it later. 

Definition 4.5 (Aggressive approximation). Given a 
set of sources S = {Si, ..., S n }, the recall r, and false positive 
rate qt of each source St, and the joint recall and joint false positive 
rate for sets S and S — St, the aggressive approximation of the 
probability Pr (t \ Ot) is defined as: — 1 —, where 

a Vaggr 



fi ■ Pl2...(i-l)(i+l)...n 


c~ = -- (15) 

5i • 5l2...(i-l)(i+l)...n 

Eq. (13) differs from ( 8 ) in that it replaces r; (resp. q,) with 
Cfr; (resp. C~ qf). Intuitively, Cf and C~ represent the corre¬ 
lation between St and the rest of S, in the case of true and false 
triples respectively. Eq. (13) weighs n and 5 , by these “correlation” 
parameters. When the sources are independent, Cf = C~ = 1, 
and the approximation obtains the same result as Theorem 3.1. In 
contrast with Definition 2.4, aggressive approximation only uses 
2 n + 1 instead of 2 ( 2 ” — n — 1 ) correlation parameters. 

Corollary 4.6. Given a set S = {Si,..., S n }, where all 
sources are independent, the correctness probabilities computed 
using Theorem 3.1 and Definition 4.5 are the same. 

EXAMPLE 4.7. Consider triple tg in Figure la. Figure 3 shows 
the correlation parameters for each source and illustrates how they 
are computed for Si. These parameters indicate that Si, S 4 and 
Sg are positively correlated for false triples; for true triples, S 3 is 


I. Correlation factors 

Accounting for correlations, the probability that a set S * of sources 
all provide a true triple is rs* instead of DIs <=s* Vi (similarly for a 
false triple). We define two correlation factors: Cs * and Cij.: 


Cs* 


a 


5 * 


Pr (S* |= t | t) = r 5 * 

P ti n dep (S* |= t | t) n Sl65 * Ti 

Pr (S* |= t | -.t) = qs « 

Pfmdep (S | t | 't) fig £,5* 


( 16 ) 

(17) 


If the sources in S* are independent, then Cs * = Cf-* = 1. 
Deviation from independence may produce values greater than 1, 
which imply positive correlations ( e.g ., for S 4 and Sg in Figure la, 
^ 45 = 0 674 )' 67 = 1-5 > 1), or lower than 1, which imply negative 
correlations, also known as anti-correlations (e.g., for Si, S 3 in 
Figure la, C 43 = 0 6 ° 7 3 0 3 67 = 0.75 < 1). 

Using separate parameters for true triples and false triples allows 
for a richer representation of correlations. In fact, two sources may 
be correlated differently for true and false triples. For example, 
sources S 2 and S 3 in Figure la are independent with respect to true 
triples (C 23 = 1 ), and negatively correlated with respect to false 
triples (Cf 3 = 0.5 < 1). 

Using correlation factors, Pr (O \ tg) in our running example can 
be rewritten as follows: 


Pr (Ot 8 | tg) =Ci24grir 2 r4rg - Ci2345rir 2 r 3 r4r5 


II. Assumptions on correlation factors 

To transform the equations with correlation factors into a simpler 
form, we make partial independence assumptions. Before we for¬ 
mally state the assumptions, we first illustrate it using an example. 


EXAMPLE 4.9. Consider sources S = {Si ... S 5 } and assume 
S 4 is independent of the set of sources {Si, S 2 , S 3 } and of sources 
{Si, S 2 , S 3 , S 5 }. Then, we have 7-123 • 7-4 = 7-1234 and 7-12345 = 
7-1235 • r 4 . Thus, 7 - 1237-12345 = Ti234ri235. Using the definition of 
the correlation factors, it follows that C 123 C 12345 = C 1234 C 1235 . 



Accordingly, we can rewrite the correlation factors; for example 
c = gi 234 C 123 5 an d similarly, C 23 = c ^c 1235 c 23iB . Com- 

^-'12345 y (<-^12345) 

billing Eqs (14) and (16), the following equations hold: Cf = 
c + _ Cua45 c + C12345 Usins these we can 

rewrite the following correlation factors: C 123 = Cl23i3 and 

C A Ck 


C23 


£-'12345 

C+C+C+- 


According to Eqs.(14-17), we can compute Cf = c Cs — and 

_ c~* 

C t = c 5 —. As illustrated in Example 4.9, partial independence 
assumptions lead to the following equations: 


Cs. = 


C s 


nSi65\5* ^ 

As a special form, when S* 

Cs = U C 't 

Sies 


and Cs* = —-—- 3 - (18) 

ns ; 65\5. 

= 0, we have Cs* = = 1, so, 

and Cs = n c r (19) 

SiGS 


Under these assumptions, Pr (O \ tg) in our running example can 
be rewritten as follows: 

Pr (Otg | tg) = ( ' 1 ^° r 1 r 2 r 4 r 5 - 6 + 23457 +r 2 r 3 r 4 r 5 

C 3 


III. Transformation 

We are ready to transform the equation into a simpler form, which 
is the same as the one in Definition 4.5. We continue illustrating the 
main intuition with our running example: 


Pr(Ot 8 | tg) = C ^ 45 rir 2 r 4 r 5 (l - Cf r 3 ) 

09 iCfC+C+CfCf 1 „+ , 

= —— — -rir 2 r4r 5 (l - Cf r 3 ) 


Cf 


4.3 Elastic approximation 

So far we have presented two solutions: the exact solution gives 
precise probabilities but is computationally expensive; the aggres¬ 
sive approximation enforces partial independence assumptions re¬ 
sulting in linear complexity, but in the worst case can compute 
probabilities independent of the quality of the sources. In this sec¬ 
tion, we present an elastic approximation algorithm that makes a 
tradeoff between efficiency and quality. 

The key idea of the elastic approximation is to use the linear 
approximation as a starting point and gradually adjust the results 
by relaxing the assumptions in every step. We call the algorithm 
“elastic” because it can be configured to iterate over different levels 
of adjustments, depending on the desired level of approximation. 
We illustrate this idea with our running example. 


EXAMPLE 4.10. Triple tg is provided by four sources •St 8 — 
{iSi, S 2 , S4, 55} (Figure la). We will adjust the linear approxi¬ 
mation of Pr (Ot a | tg) from Eq. (20), by adding specific terms 
at every level. We refer to the degree of a term in the aggres¬ 
sive approximation, as the number of recall (or false positive rate) 
parameters associated with that term. The aggressive approxi¬ 
mation for Pr (Ot s | tg) contains two terms of degrees 4 and 5 : 
Cf Cf Cf Cf rir2r 4 r.5 and Cf Cf Cf Cf Cf rir2r 3 r 4 rs respec¬ 
tively (directly derived from Eq. (20)). 

Elastic approximation makes corrections to the aggressive ap¬ 
proximation based on terms of a given degree at every level. At 


Algorithm 1 ELASTIC (Elastic approximation) 


1 

R t- rs, n d- 

C+n); 





SiGSt 





2 

q <- is t n (- 1 ~ 






SiGSt 





3 

for Z = 1 —> X do 


> A > 1 is the desired 

adjustment level 

4 

for all subsets S* 

C S { of 

size l do 



5 

Si <- {St u 

S*}; 




6 

R <— R+ (- 


- Cs t risj£<s* 

cf) 

n s , 6 5, r i'- 

7 

Q g — Q + {- 


- Cs t risies* 

c~) 

n.S'. eS, H> 

8 

return f ; 






level -0 we consider the terms with degree of <St s | +0 = 4, i.e., 
the term Cf Cf Cf Cf r+^t+rs; the exact coefficient of the term 
is C 1245 but we approximated it to Cf Cf Cf Cf based on the as¬ 
sumption that 6+245 = Cl 2 f 5 = Cf Cf Cf Cf. To remove the as- 

C 3 

sumption, we need to replace {Cf r\){Cf r 2 ){Cf ry){Cf rg) with 
C'i245nr2r 4 r5 = ri245- Since 7+245 = 91245 = 0.22, we have 


0.22 

022 


1-0.75-0.67 

1-0.167 


= 0.6 


Note that the level -0 adjustment affects not only terms with degree 
4 , but actually all terms as we show next. 

At level- 1 , we consider the terms with degree of |<St 8 1 + 1 = 5. 
After level -0 adjustment, the 5 -degree term is Ci24sCf rir 2 r 3 r4rg. 
We will replace Ci 2 4 3 Cf with the exact coefficient C 12345 , which 
will now give us the exact solution. 

In summary, the p agg r parameter calculated by the aggressive 
approximation, the level -0 adjustment, and the level -1 adjustment 
are 0 . 3 , 0 . 6 , and 0.59 respectively. Note that, as is the case in this 
example, we don’t need to compute all the levels; stopping after a 
constant number of levels can get close to the exact solution. 


Our ELASTIC algorithm (Algorithm 1) contains the pseudo code 
of our elastic approximation. Lines 1-2 compute the initial values 
of the numerator R and denominator Q for p. Note that they have 
already applied the level-0 adjustment. Then for each level l from 
1 up to the required level A (line 3), we consider each term with 
degree <S t | + l (lines 4-5), and make up the difference between the 
exact coefficient and the approximate coefficient (lines 6-7). Finally, 
line 8 returns f as the value of p. 

Proposition 4.11. Given a set ofn sources, a set ofm triples 
for probability computation, and an approximation level A, ELAS¬ 
TIC takes times Ofm ■ n x ) and the number of required correlation 
parameters is in 0 (m ■ n A ). 


5. EVALUATION 

This section describes a thorough evaluation of our models on 
three real-world datasets as well as synthetic data. Our experimental 
results show that (1) considering correlations between sources can 
significantly improve fusion results; (2) our elastic approximation 
can effectively estimate triple probability with much shorter execu¬ 
tion time; and (3) even in presence of only independent sources, our 
model can outperform state-of-the-art data fusion approaches. 

Datasets 

We first describe the real-world datasets we used in our experiments; 
we describe our synthetic data generation in Section 5.2. 

ReVerb: The ReVerb ClueWeb Extraction dataset [11] samples 
500 sentences from the Web using Yahoo’s random link service and 




uses 6 extractors to extract triples from these sentences. The gold 
standard contains 2407 extracted triples (616 true and 1791 false). 
Restaurant: The restaurant dataset from [17] consists of triples on 
the location of a collection of 1000 restaurants provided by 7 sources 
(Yelp, Foursquare, OpenTable, MechanicalTurk, YellowPages, City- 
Search, MenuPages). The gold standard contains 93 triples (68 true 
and 25 false), selected by majority vote over 10 Mechanical Turk 
responses. 


Book: The book dataset from [6] was collected by crawling abe- 
books.com. The dataset consists of 5900 unique book-author triples 
from 879 seller sources. The gold standard consists of 225 randomly 
sampled books for which the authors are manually identified from 
book covers; 482 authors are correctly provided for these books and 
935 authors are wrongly provided. Note that our version of this 


dataset nas more noise man me one 
challenging setting. 



usea in [zoj, resulting in a more 

We observe that these datasets 
display varied characteristics: 
the sources in Restaurant all 
have high precision, and most 
have high recall; the sources in 
ReVerb have fairly low preci¬ 
sion and recall; the sources in 
Book have large variations in 
precision, and most of them have 
low recall. Such differences al¬ 
low us to evaluate our models in 
a variety of scenarios. 


Comparisons 

We compared our models with several state-of-the-art techniques 
that apply to the independent-triple and open-world semantics. 

Union- A': Considers a triple to be true if at least K% of the sources 
provide it. Union-50 is equivalent to majority voting. 

3-Estimate [13]: Iteratively computes trustworthiness of sources, 
trustworthiness of triples, and truthfulness of triples. This is the best 
model among the three proposed in [13], and we observed similar 
results from the other two models on our datasets. 

LTM [25]: Constructs a graphical model and uses Gibbs sampling 
to determine source quality and truthfulness of each triple. We used 
the default parameters suggested by [25]. 

PrecRec (Section 3): Computes truthfulness of each triple from the 
precision and recall of each source. We set a = 0.5 and computed 
source precision and recall according to the gold standard. 
PrecRecCorr (Section 4): Extends PrecRec by considering corre¬ 
lation between sources. By default we report the results for the exact 
solution; however, as we show in Figure 5, we obtain similar results 
using level-3 elastic approximation. We computed joint precision 
and recall according to the gold standard. Note that Book is consid¬ 
erably larger than the other two datasets, which poses challenges for 
deriving the correlation parameters: (a) the number of correlation 
parameters is very large, and (b) there may not be enough support 
data to understand the correlation among the sources. We overcome 
this issue using a simple clustering approach: we divide sources 
into clusters based on their pairwise correlations, and assume that 
sources across clusters are independent. 

We used a C# implementation of LTM and we implemented the 
other models in Java. For ReVerb, Restaurant, and synthetic data, 
we ran experiments on a Macbook Air with 4GB RAM. 1.7 GHz In¬ 
tel Core i5 processor, and OSX Lion 10.7.5. The Book experiments 
were run on a ml.large Amazon EC2 server instance [1], 


Metrics 

We present results according to three metrics. 

Precision/Recall/Fl: We measure the correctness of binary deci¬ 
sions with three metrics. Precision measures among the returned 
true triples, how many are indeed true; recall measures among the 
provided true triples, how many are returned; F-measure computes 
their harmonic mean (i.e., FI = 2 ' pr< T rec ). 

PR-curve/ROC-curve: We rank the provided triples in decreasing 
order of the computed truthfulness score (for UNION-A', we rank 
in decreasing order of the number of providers). As we add the 
triples gradually, PR-cur\’e plots the precision versus the recall after 
adding each triple and ROC-curve plots the true positive rate versus 
the false positive rate. In addition, we compute the area under the 
curve, calledAt/C-PR and AUC-ROC respectively. These curves and 
measures allow us to examine whether the correctness probabilities 
we compute are consistent with the reality. 

Execution time: We report execution time for each method. 

5.1 Real-World Data 

We first compare the different models on the three real-world data 
sets. Figure 4 reports the precision, recall, and F-measure of each 
method on each dataset. We also plot the PR-curve and ROC-curve 
of the methods on each data set. Note that the curves for UNION-A' 
of different K are the same so we plot only one; also note that the 
results of 3-Estimate are significantly worse than other methods, 
so we did not plot its curves to avoid cluttering. 

Overall, we observe that among different datasets, most of the 
methods obtain higher quality results on Restaurant and Book, 
but lower quality on ReVerb. This is not surprising given that the 
data sources in ReVerb have fairly low precision and recall and 
they extract a lot of wrong triples. PrecRecCorr obtains the best 
results on all datasets: comparing with PrecRec, its F-measure is 
5.2% higher on average, its AUC-PR is 10.3% higher on average, 
and its AUC-ROC is 3.3% higher on average. We note that although 
the improvement on F-measure is not that large, the improvement 
for AUC-PR and AUC-ROC is significant; this is because with 
consideration of correlations between the sources, we often compute 
a much higher probability for a true triple and a lower probability 
for a false triple, but this difference may be hindered when we apply 
the threshold and make binary decisions. 

Among the methods that assume independence between sources, 
PrecRec obtains the best results: on average its F-measure is 14% 
higher than LTM and 41% higher than 3-Estimate. For LTM, its 
F-measure is comparable to PrecRec on Restaurant and Book, 
but much lower on ReVerb because of a very low precision. Its 
PR-curves and ROC-curves are not in a very good shape; indeed, its 
AUC-PR is 24% lower than PrecRec and its AUC-ROC is 20.8% 
lower on average. We observed that the probabilities it outputs typi¬ 
cally fall in extreme ranges; for example, for most of the triples that 
it considers as true on Restaurant, it computes a probability very 
close to 1. 3-Estimate obtains very low recall in all of the three 
datasets; as a result, its F-measure is the lowest among all methods. 

For Union-A', increasing A increases the precision but drops the 
recall. Union-25 turns out to have the best F-measure, comparable 
to PrecRec on each data set, but lower than PrecRecCorr. How¬ 
ever, its PR-curves and ROC-curves are in slightly worse shapes 
comparing with PrecRec; indeed, its AUC-PR and AUC-ROC is 
lower than that of PrecRec by up to 4.5%. As we show later on 
synthetic data, Union-A is sensitive on source quality; for example, 
even Union-25 can obtain very low F-measure when the sources 
have low precision or low recall. 

Figure 5b shows the execution time of the different models. 









(a) Fusion results, and Precision-Recall and ROC curves for the ReVerb data set. 
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(b) Fusion results, and Precision-Recall and ROC curves for the Restaurant data set. 
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(c) Fusion results, and Precision-Recall and ROC curves for the Book data set. 


Figure 4: Our experiments show that PrecRec and PrecRecCorr result in better fusion results compared to other approaches. In 
the ReVerb dataset both PrecRec and PrecRecCorr showed significant improvement in the F-measure compared to the state-of 
the art (3-Estimate, and LTM). In the Restaurant and Book datasets, LTM and Union-25 are comparable to the results of PrecRec, 
but the PR and ROC curves demonstrate that PrecRecCorr provides significantly better truthfulness estimates for triples. 


Union-/\ is very efficient, while 3-Estimate and PrecRec are 
the next most efficient, with runtimes up to one order of magnitude 
longer than Union. We terminated LTM after 10 iterations; each 
iteration on average took 5.6 times longer than PrecRec. PrecRec¬ 
Corr is one order of magnitude slower than PrecRec on average; 
however, the level-3 elastic approximation obtained similar results 
but finished in only half of the time. For our largest dataset (Book), 
level-3 approximated the exact solution in 40 minutes; we con¬ 
sider these runtimes reasonable, since this is an offline cleaning 
process. Parallelization can significantly improve the efficiency of 
PrecRecCorr, as the terms at different levels and across different 
clusters can be computed independently. With maximum paralleliza¬ 
tion PrecRecCorr terminates in 80 seconds, however a systematic 
study of these improvements is outside the scope of this paper. 

Elastic approximation: Figure 5 demonstrates the behavior of our 
aggressive approximation and elastic approximation (Algorithm 1) 
over the three datasets. We observe that the aggressive estimate is 
much worse than the exact solution on ReVerb and Restaurant, 


while comparable on Book; it is even worse than PrecRec, which 
does not consider correlation. Each line in the graph shows the 
progression of the approximation from the aggressive estimate to the 
exact computation. At every level, the elastic approximation refines 
the probability estimates of the earlier levels to gradually approach 
PrecRecCorr. Since the elastic approximation is heuristic in nature, 
there is no guarantee that the method improves the estimate with 
every level (e.g., on ReVerb the elastic approximation performs 
worse at level 2 than level 1). However, for all datasets, the elastic 
approximation comes close to the exact result within a small number 
of levels. We observe that on all three data sets, the result of level-3 
approximation is already quite close to that of the exact solution, 
whereas the execution time is much shorter. 

Discovered correlations: To better understand the improvement 
of PrecRecCorr over PrecRec, we examine in more detail the 
discovered correlations between the sources. 

ReVerb has 6 sources. With respect to true triples, we detect 
strong correlation on a group of 2 sources and on a group of 3 















































































































































1 


ReVerb □ Restaurant ( 


Book —0- 


0.8 - ■.■.*.8--1—-R B <> ° ° 

1 / 



0 2 4 6 8 10 

(a) Elastic approximation levels 
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(b) Runtimes of algorithms (in seconds) for all datasets. 


Figure 5: As expected, our elastic approximation gradually approaches the result of PrecRecCorr. 
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(a) Low precision sources, with low to fair 
recall, in a dataset of 25% true triples. 
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(b) High precision sources, with increasing 
recall, in a dataset of 50% true triples. 
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(c) Low recall sources, with increasing pre¬ 
cision, in a dataset of 25% true triples. 


Figure 6: Experimental results on synthetic data with independent sources. Our techniques are particularly effective with sources of 
low quality, and demonstrate significant gains in many configurations. 


sources. With respect to false triples, 2 pairs of sources are strongly 
correlated, and one source is strongly anti-correlated with every 
other source. Of the 7 Restaurant sources, we detect strong corre¬ 
lation on a group of 4 sources and fairly strong anti-correlation on a 
pair of sources, with respect to true triples. For false triples, there is 
strong correlation on a group of 6 sources. Finally, for Book, there 
are 333 sources that provide triples in the gold standard. Recall that 
we cluster the sources according to their correlation. In terms of true 
triples, we obtain three clusters of size 22, 3, and 2. In terms of false 
triples, we obtain four clusters of size 22, 3, 2, and 2. Interestingly, 
except two sources between which we find strong correlation both 
on true triples and on false triples, the clusters for true triples and 
for false triples contain very different sources. 

These observations indicate that our model of correlation is much 
richer than what can be captured by pure copying relationships, as 
in [6]. For our datasets, [6] applies only to Book dataset by consid¬ 
ering the author list as a whole, but not the other datasets. In Book, 
this approach achieves high precision of 0.97 as it successfully de¬ 
tects copying and reduces the vote counts of false values. However, 
it has a low recall of 0.82, since it also discounts vote counts on true 
values and ignores other types of correlations. We leave an effective 
combination of that approach and ours for future work. 

5.2 Synthetic Data 

We generated synthetic data to evaluate our algorithms under a 
large range of scenarios; in this section we present interesting cases 
that arise both in the case of independent sources, as well as in the 
case of correlations. 

Our first set of experiments compares the different models on 


independent sources. We generated 5 sources providing data on 
1000 triples according to a pre-configured precision and recall; we 
averaged 10 repetitions and show the results in Figure 6. Our results 
show that even without correlations, PrecRec provides significant 
improvements over existing approaches, while PrecRecCorr has 
similar performance. Figure 6a shows the performance of all the 
algorithms against a dataset of low quality sources. LTM is quite 
robust to variations in source quality, and performs well in this 
challenging setting; however, it does not benefit much from increases 
in source quality, and PrecRec quickly becomes better as recall 
increases over 0.15. In Figures 6b and 6c, we vary recall and 
precision respectively, while keeping the other constant. In both 
cases, our techniques perform remarkably well in comparison to the 
other algorithms. Note that Union-25 is very sensitive to source 
quality and performs badly with low-quality sources. 

Our second set of experiments considers correlated sources. Fig¬ 
ure 7 demonstrates two cases: (1) a set of four sources are positively 
correlated on true triples, and (2) the sources are negatively corre¬ 
lated on false triples. In both cases, PrecRecCorr demonstrates 
significantly better performance than all the other approaches. 

6. RELATED WORK 

There has been extensive work in the area of data fusion ( i.e ., 
resolving conflicts and finding the truth); [4,8] surveyed early ap¬ 
proaches and [15] compared recent approaches on Deep Web data. 
Among these approaches, [6,14,19,20,21,23,24] jointly infer truth 
and source quality, but they assume the conflicting-triple, closed- 
world semantics. COSINE and 3ESTIMATE [13] can be applied 


























Figure 7: Experimental results on synthetic data with corre¬ 
lated sources. PrecRecCorr obtains better results compared 
to all other approaches. 

under the independent-triple, open-world semantics. Instead of 
using precision and recall of sources, it considers a single quality 
metric -accuracy of a source; we compared with them in our experi¬ 
ments (Section 5). The model closest to ours is LTM [25]; we have 
made detailed comparisons in Section 3 and in experiments. All of 
these approaches assume independence between sources. 

Correlation between sources are studied in two bodies of works. 
First, copy detection has been surveyed in [10] for various types 
of data and studied in [3,5, 6,7, 16] for structured data. Our ap¬ 
proach is different in three aspects. First, in addition to copying, we 
consider broader scopes of correlations, including positive correla¬ 
tions not caused by copying ( e.g ., extractors employing common 
extraction patterns), and also negative correlations. Second, instead 
of just discounting votes from copiers, we may boost contribu¬ 
tions from providers correlated on true triples and reduce penalty 
from non-providers anti-correlated on true triples. Third, we as¬ 
sume independent-triple and open-world semantics, opposite to 
their conflicting-triple, closed-world semantics. We have compared 
with this approach in our experiments. 

Second, there are other ways of measuring correlations. Qi 
et al. [22] constructed a graphical model that clusters dependent 
sources into groups and measures the quality of each group as a 
whole (instead of each individual source). Kappa measure [12] mea¬ 
sures correlation by taking into account the agreement by chance. 
We measure correlations by the joint precision and recall for subsets 
of sources. Our measures have much higher expressiveness in that 

(1) they consider both positive and negative correlations; (2) they 
distinguish correlation on true data and on false data; and (3) they 
essentially consider correlation for every subset of sources. 

7. CONTRIBUTIONS AND FUTURE WORK 

In this paper we presented a novel technique for fusing data that 
contains correlations, which uses Bayesian analysis to derive the 
truthfulness of a fact based on the quality of sources that provide it. 
We evaluated our approach against other state-of-the-art techniques, 
and showed that our algorithms achieve significant improvements in 
the fusion results. The power of our approach lies in its generality: 
our algorithms do not need to have any knowledge of possible 
correlations, and all required parameters can be computed from 
a training set. As a result, PrecRec and PrecRecCorr perform 
well even in low quality datasets that prove challenging for other 
techniques. 

There are still several interesting challenges in this problem. Our 
model uses independent-triple, open-world semantics, which allows 
our techniques to consider multiple truth values for an entity (e.g., a 
person may have multiple professions). Flowever, this assumption 
may not always apply (e.g ., a person only has a single birth date). We 


consider modifications in our model to account for such scenarios 
in future work. Another challenge is that source quality may vary, 
based on the domain. For example, a source may have low overall 
precision, but may be particularly accurate with respect to Pizzerias, 
or restaurants in the Bay Area. In our model, we can consider 
domains separately, but deriving the proper domain subdivisions 
automatically is not straightforward. 
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